Roomba: Automatic Validation, Correction and Generation of Dataset Metadata
نویسندگان
چکیده
Data is being published by both the public and private sectors and covers a diverse set of domains ranging from life sciences to media or government data. An example is the Linked Open Data (LOD) cloud which is potentially a gold mine for organizations and individuals who are trying to leverage external data sources in order to produce more informed business decisions. Considering the significant variation in size, the languages used and the freshness of the data, one realizes that spotting spam datasets or simply finding useful datasets without prior knowledge is increasingly complicated. In this paper, we propose Roomba, a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles. While Roomba is generic, we target CKAN-based data portals and we validate our approach against a set of open data portals including the Linked Open Data (LOD) cloud as viewed on the DataHub. The results demonstrate that the general state of various datasets and groups, including the LOD cloud group, needs more attention as most of the datasets suffer from bad quality metadata and lack some informative metrics that are required to facilitate dataset search.
منابع مشابه
Metadata Enrichment for Automatic Data Entry Based on Relational Data Models
The idea of automatic generation of data entry forms based on data relational models is a common and known idea that has been discussed day by day more than before according to the popularity of agile methods in software development accompanying development of programming tools. One of the requirements of the automation methods, whether in commercial products or the relevant research projects, ...
متن کاملWhat's up LOD Cloud? Observing The State of Linked Open Data Cloud Metadata
Linked Open Data (LOD) has emerged as one of the largest collections of interlinked datasets on the web. In order to benefit from this mine of data, one needs to access descriptive information about each dataset (or metadata). However, the heterogeneous nature of data sources reflects directly on the data quality as these sources often contain inconsistent as well as misinterpreted and incomple...
متن کاملLearning Object Metadata and Automatic Processes: Issues and Perspectives
Generation of learning object metadata • Understanding the issues related with the generation of learning object metadata. • Identifying the opportunities and drawbacks of using automatic techniques for generating learning object metadata. Validation of learning object metadata • Understanding the validation of learning object metadata. • Identifying the opportunities and drawbacks of automatic...
متن کاملتولید خودکار الگوهای نفوذ جدید با استفاده از طبقهبندهای تک کلاسی و روشهای یادگیری استقرایی
In this paper, we propose an approach for automatic generation of novel intrusion signatures. This approach can be used in the signature-based Network Intrusion Detection Systems (NIDSs) and for the automation of the process of intrusion detection in these systems. In the proposed approach, first, by using several one-class classifiers, the profile of the normal network traffic is established. ...
متن کاملFunctionalities for automatic metadata generation applications: a survey of metadata experts' opinions
This paper reports on the automatic metadata generation applications (AMeGA) project’s metadata expert survey. Automatic metadata generation research is reviewed and the study’s methods, key findings and conclusions are presented. Participants anticipate greater accuracy with automatic techniques for technical metadata (e.g., ID, language, and format metadata) compared to metadata requiring int...
متن کامل